claude opus 4
This is the most misunderstood graph in AI
To some, METR's "time horizon plot" indicates that AI utopia--or apocalypse--is close at hand. The truth is more complicated. Every time OpenAI, Google, or Anthropic drops a new frontier large language model, the AI community holds its breath. It doesn't exhale until METR, an AI research nonprofit whose name stands for "Model Evaluation & Threat Research," updates a now-iconic graph that has played a major role in the AI discourse since it was first released in March of last year. The graph suggests that certain AI capabilities are developing at an exponential rate, and more recent model releases have outperformed that already impressive trend. That was certainly the case for Claude Opus 4.5, the latest version of Anthropic's most powerful model, which was released in late November.
- North America > United States > Massachusetts (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Asia > China (0.04)
How Claude Code Is Reshaping Software--and Anthropic
WIRED spoke with Boris Cherny, head of Claude Code, about how the viral coding tool is changing the way Anthropic works. Engineers in Silicon Valley have been raving about Anthropic's AI coding tool, Claude Code, for months. But recently, the buzz feels as if it's reached a fever pitch. Earlier this week, I sat down with Boris Cherny, head of Claude Code, to try to understand how the company is meeting this moment. "We built the simplest possible thing," said Cherny. "The craziest thing was learning three months ago that half of the sales team at Anthropic uses Claude Code every week."
- North America > United States > California (0.35)
- Asia > China (0.05)
- Europe > Slovakia (0.04)
- (3 more...)
Why the World's Best AI Systems Are Still So Bad at Pokémon
Why the World's Best AI Systems Are Still So Bad at Pokémon Pillay is an editorial fellow at TIME. Pillay is an editorial fellow at TIME. Right now, live on Twitch, you can watch three of the world's smartest AI systems-- GPT 5.2, Claude Opus 4.5, and Gemini 3 Pro --doing their best to beat classic Pokémon games. At least by human standards, they are not very good. The systems are slow, overconfident, and often confused.
- North America > United States (0.05)
- Europe > France (0.05)
- Asia > China (0.05)
- Africa (0.05)
Reasoning Models Ace the CFA Exams
Patel, Jaisal, Chen, Yunzhe, He, Kaiwen, Wang, Keyi, Li, David, Xiao, Kairong, Liu, Xiao-Yang
Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.
- North America > United States > North Carolina > Orange County > Chapel Hill (0.04)
- North America > United States > New York > Rensselaer County > Troy (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > South Korea (0.04)
IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch
Biyani, Param, Kirtania, Shashank, Bajpai, Yasharth, Gulwani, Sumit, Tiwari, Ashish
We introduce IndiMathBench, a human-verified benchmark designed to evaluate mathematical theorem proving, curated using an AI-powered human-assisted pipeline for formalizing natural language problems in Lean. IndiMathBench is composed of 312 formal Lean 4 theorems paired with their corresponding informal problem statements, sourced from Indian Mathematics Olympiads. Through category-based retrieval, iterative compiler feedback, and multi-model ensembles, our pipeline generates candidate formalizations that experts efficiently validate via an interactive dashboard with automated quality summaries. Evaluation across multiple frontier models demonstrates that autoformalization remains challenging, with substantial gaps between syntactic validity and semantic correctness, while theorem proving success rates remain low even with iterative refinement, demonstrating that \benchmark~presents a challenging testbed for mathematical reasoning. IndiMathBench is available at https://github.com/prmbiy/IndiMathBench.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.68)
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
Su, Hongjin, Diao, Shizhe, Lu, Ximing, Liu, Mingjie, Xu, Jiacheng, Dong, Xin, Fu, Yonggan, Belcak, Peter, Ye, Hanrong, Yin, Hongxu, Dong, Yi, Bakhturina, Evelina, Yu, Tao, Choi, Yejin, Kautz, Jan, Molchanov, Pavlo
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B
Xu, Sen, Zhou, Yi, Wang, Wei, Min, Jixin, Yin, Zhibin, Dai, Yingwei, Liu, Shixi, Pang, Lianyu, Chen, Yirong, Zhang, Junlin
Challenging the prevailing consensus that small models inherently lack robust reasoning, this report introduces VibeThinker-1.5B, a 1.5B-parameter dense model developed via our Spectrum-to-Signal Principle (SSP). This challenges the prevailing approach of scaling model parameters to enhance capabilities, as seen in models like DeepSeek R1 (671B) and Kimi k2 (>1T). The SSP framework first employs a Two-Stage Diversity-Exploring Distillation (SFT) to generate a broad spectrum of solutions, followed by MaxEnt-Guided Policy Optimization (RL) to amplify the correct signal. With a total training cost of only $7,800, VibeThinker-1.5B demonstrates superior reasoning capabilities compared to closed-source models like Magistral Medium and Claude Opus 4, and performs on par with open-source models like GPT OSS-20B Medium. Remarkably, it surpasses the 400x larger DeepSeek R1 on three math benchmarks: AIME24 (80.3 vs. 79.8), AIME25 (74.4 vs. 70.0), and HMMT25 (50.4 vs. 41.7). This is a substantial improvement over its base model (6.7, 4.3, and 0.6, respectively). On LiveCodeBench V6, it scores 51.1, outperforming Magistral Medium's 50.3 and its base model's 0.0. These findings demonstrate that small models can achieve reasoning capabilities comparable to large models, drastically reducing training and inference costs and thereby democratizing advanced AI research.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence
Sharrock, Callum, Petersson, Lukas, Petersson, Hanna, Backlund, Axel, Wennström, Axel, Nordström, Kristoffer, Aronsson, Elias
We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical world. Current state-of-the-art robotic systems use a hierarchical architecture with LLMs in charge of high-level reasoning, and a Vision Language Action (VLA) model for low-level control. Butter-Bench evaluates the LLM part in isolation from the VLA. Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench. The best LLMs score 40% on Butter-Bench, while the mean human score is 95%. LLMs struggled the most with multi-step spatial planning and social understanding. We also evaluate LLMs that are fine-tuned for embodied reasoning and conclude that this training does not improve their score on Butter-Bench. Language models (LMs) were initially intended for narrow text understanding tasks. The first Transformer-based LM (V aswani et al., 2017) was explicitly trained for translation. However, large-scale training runs of LMs eventually resulted in emergent behaviour - model capabilities that were not explicitly trained for (Brown et al., 2020). For example, LLMs are not trained to be robots, yet companies such as Figure (Helix, 2025) and Google DeepMind (Gemini Robotics 1.5, 2025) use LLMs in their robotic stack.
- Leisure & Entertainment (0.93)
- Information Technology (0.68)
AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?
Press, Ori, Amos, Brandon, Zhao, Haoyu, Wu, Yikai, Ainsworth, Samuel K., Krupke, Dominik, Kidger, Patrick, Sajed, Touqir, Stellato, Bartolomeo, Park, Jisun, Bosch, Nathanael, Meril, Eli, Steppi, Albert, Zharmagambetov, Arman, Zhang, Fangzhao, Perez-Pineiro, David, Mercurio, Alberto, Zhan, Ni, Abramovich, Talor, Lieret, Kilian, Zhang, Hanlin, Huang, Shirley, Bethge, Matthias, Press, Ofir
Despite progress in language model (LM) capabilities, evaluations have thus far focused on models' performance on tasks that humans have previously solved, including in programming (Jimenez et al., 2024) and mathematics (Glazer et al., 2024). We therefore propose testing models' ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 154 coding tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner uses a simple, budgeted loop that edits code, compiles and runs it, profiles performance, verifies correctness on tests, and selects the fastest valid version. AlgoTuner achieves an average 1.72x speedup against our reference solvers, which use libraries such as SciPy, sk-learn and CVXPY. However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- North America > United States > Indiana > Marion County > Indianapolis (0.04)
- North America > Canada (0.04)
- (6 more...)
PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits
Bhuiya, Neeladri, Aggarwal, Madhav, Purwar, Diptanshu
Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- Europe > Ukraine > Crimea (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Workflow (1.00)
- Instructional Material (0.86)
- Research Report > New Finding (0.46)
- Personal > Interview (0.46)